Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Hybred: An OCR Document Representation for Classification Tasks

Identifieur interne : 000344 ( Main/Exploration ); précédent : 000343; suivant : 000345

Hybred: An OCR Document Representation for Classification Tasks

Auteurs : Sami Laroum [France] ; Nicolas Béchet [France] ; Hatem Hamza [France] ; Mathieu Roche [France]

Source :

RBID : Hal:lirmm-00723581

English descriptors

Abstract

The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to this evaluation, we propose the HYBRED (HYBrid REpresentation of Documents) approach which combines different features in a single relevant representation. The experiments conducted on real data show the interest of this approach.

Url:


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Hybred: An OCR Document Representation for Classification Tasks</title>
<author>
<name sortKey="Laroum, Sami" sort="Laroum, Sami" uniqKey="Laroum S" first="Sami" last="Laroum">Sami Laroum</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-181" status="VALID">
<idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc>
<address>
<addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation>
<relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="UMR5506" active="#struct-410122" type="direct">
<org type="institution" xml:id="struct-410122" status="VALID">
<orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc>
<address>
<addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="direct">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Bechet, Nicolas" sort="Bechet, Nicolas" uniqKey="Bechet N" first="Nicolas" last="Béchet">Nicolas Béchet</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-2446" status="OLD">
<idno type="RNSR">200318386B</idno>
<orgName>Usage-centered design, analysis and improvement of information systems</orgName>
<orgName type="acronym">AxIS</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-34586" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-86790" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-34586" type="direct">
<org type="laboratory" xml:id="struct-34586" status="VALID">
<idno type="RNSR">198318250R</idno>
<orgName>Inria Sophia Antipolis - Méditerranée </orgName>
<orgName type="acronym">CRISAM</orgName>
<desc>
<address>
<addrLine>2004 route des Lucioles BP 93 06902 Sophia Antipolis</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/sophia/</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-86790" type="direct">
<org type="laboratory" xml:id="struct-86790" status="VALID">
<idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc>
<address>
<addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Hamza, Hatem" sort="Hamza, Hatem" uniqKey="Hamza H" first="Hatem" last="Hamza">Hatem Hamza</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23810" status="VALID">
<orgName>Itesoft R&D</orgName>
<desc>
<address>
<addrLine>Parc d'Andron - Le Sequoïa 30470 Aimargues</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.itesoft.fr</ref>
</desc>
<listRelation>
<relation active="#struct-365824" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-365824" type="direct">
<org type="institution" xml:id="struct-365824" status="INCOMING">
<orgName>ITESOFT</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Roche, Mathieu" sort="Roche, Mathieu" uniqKey="Roche M" first="Mathieu" last="Roche">Mathieu Roche</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-392245" status="VALID">
<orgName>Exploration et exploitation de données textuelles</orgName>
<orgName type="acronym">TEXTE</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr/recherche/equipes/texte</ref>
</desc>
<listRelation>
<relation active="#struct-181" type="direct"></relation>
<relation name="UMR5506" active="#struct-410122" type="indirect"></relation>
<relation name="UMR5506" active="#struct-441569" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-181" type="direct">
<org type="laboratory" xml:id="struct-181" status="VALID">
<idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc>
<address>
<addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation>
<relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-410122" type="indirect">
<org type="institution" xml:id="struct-410122" status="VALID">
<orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc>
<address>
<addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:lirmm-00723581</idno>
<idno type="halId">lirmm-00723581</idno>
<idno type="halUri">http://hal-lirmm.ccsd.cnrs.fr/lirmm-00723581</idno>
<idno type="url">http://hal-lirmm.ccsd.cnrs.fr/lirmm-00723581</idno>
<date when="2011-05">2011-05</date>
<idno type="wicri:Area/Hal/Corpus">000058</idno>
<idno type="wicri:Area/Hal/Curation">000058</idno>
<idno type="wicri:Area/Hal/Checkpoint">000086</idno>
<idno type="wicri:doubleKey">1694-0784:2011:Laroum S:hybred:an:ocr</idno>
<idno type="wicri:Area/Main/Merge">000349</idno>
<idno type="wicri:Area/Main/Curation">000344</idno>
<idno type="wicri:Area/Main/Exploration">000344</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">Hybred: An OCR Document Representation for Classification Tasks</title>
<author>
<name sortKey="Laroum, Sami" sort="Laroum, Sami" uniqKey="Laroum S" first="Sami" last="Laroum">Sami Laroum</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-181" status="VALID">
<idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc>
<address>
<addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation>
<relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle name="UMR5506" active="#struct-410122" type="direct">
<org type="institution" xml:id="struct-410122" status="VALID">
<orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc>
<address>
<addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="direct">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Bechet, Nicolas" sort="Bechet, Nicolas" uniqKey="Bechet N" first="Nicolas" last="Béchet">Nicolas Béchet</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-2446" status="OLD">
<idno type="RNSR">200318386B</idno>
<orgName>Usage-centered design, analysis and improvement of information systems</orgName>
<orgName type="acronym">AxIS</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
<listRelation>
<relation active="#struct-34586" type="direct"></relation>
<relation active="#struct-300009" type="indirect"></relation>
<relation active="#struct-86790" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-34586" type="direct">
<org type="laboratory" xml:id="struct-34586" status="VALID">
<idno type="RNSR">198318250R</idno>
<orgName>Inria Sophia Antipolis - Méditerranée </orgName>
<orgName type="acronym">CRISAM</orgName>
<desc>
<address>
<addrLine>2004 route des Lucioles BP 93 06902 Sophia Antipolis</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/sophia/</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle active="#struct-300009" type="indirect">
<org type="institution" xml:id="struct-300009" status="VALID">
<orgName>Institut National de Recherche en Informatique et en Automatique</orgName>
<orgName type="acronym">Inria</orgName>
<desc>
<address>
<addrLine>Domaine de VoluceauRocquencourt - BP 10578153 Le Chesnay Cedex</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/en/</ref>
</desc>
</org>
</tutelle>
<tutelle active="#struct-86790" type="direct">
<org type="laboratory" xml:id="struct-86790" status="VALID">
<idno type="RNSR">196718247G</idno>
<orgName>INRIA Paris-Rocquencourt</orgName>
<desc>
<address>
<addrLine>INRIA Rocquencourt : Domaine de Voluceau, Rocquencourt B.P. 105 78153 le Chesnay Cedex / INRIA Paris - 23 avenue d'Italie 75013 Paris</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.inria.fr/centre/paris-rocquencourt</ref>
</desc>
<listRelation>
<relation active="#struct-300009" type="direct"></relation>
</listRelation>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Hamza, Hatem" sort="Hamza, Hatem" uniqKey="Hamza H" first="Hatem" last="Hamza">Hatem Hamza</name>
<affiliation wicri:level="1">
<hal:affiliation type="laboratory" xml:id="struct-23810" status="VALID">
<orgName>Itesoft R&D</orgName>
<desc>
<address>
<addrLine>Parc d'Andron - Le Sequoïa 30470 Aimargues</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.itesoft.fr</ref>
</desc>
<listRelation>
<relation active="#struct-365824" type="direct"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-365824" type="direct">
<org type="institution" xml:id="struct-365824" status="INCOMING">
<orgName>ITESOFT</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
<author>
<name sortKey="Roche, Mathieu" sort="Roche, Mathieu" uniqKey="Roche M" first="Mathieu" last="Roche">Mathieu Roche</name>
<affiliation wicri:level="1">
<hal:affiliation type="researchteam" xml:id="struct-392245" status="VALID">
<orgName>Exploration et exploitation de données textuelles</orgName>
<orgName type="acronym">TEXTE</orgName>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr/recherche/equipes/texte</ref>
</desc>
<listRelation>
<relation active="#struct-181" type="direct"></relation>
<relation name="UMR5506" active="#struct-410122" type="indirect"></relation>
<relation name="UMR5506" active="#struct-441569" type="indirect"></relation>
</listRelation>
<tutelles>
<tutelle active="#struct-181" type="direct">
<org type="laboratory" xml:id="struct-181" status="VALID">
<idno type="RNSR">199111950H</idno>
<orgName>Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier</orgName>
<orgName type="acronym">LIRMM</orgName>
<date type="start">1995</date>
<desc>
<address>
<addrLine>CC 477, 161 rue Ada, 34095 Montpellier Cedex 5</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.lirmm.fr</ref>
</desc>
<listRelation>
<relation name="UMR5506" active="#struct-410122" type="direct"></relation>
<relation name="UMR5506" active="#struct-441569" type="direct"></relation>
</listRelation>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-410122" type="indirect">
<org type="institution" xml:id="struct-410122" status="VALID">
<orgName>Université de Montpellier</orgName>
<orgName type="acronym">UM</orgName>
<desc>
<address>
<addrLine>163 rue Auguste Broussonnet - 34090 Montpellier</addrLine>
<country key="FR"></country>
</address>
<ref type="url">http://www.umontpellier.fr/</ref>
</desc>
</org>
</tutelle>
<tutelle name="UMR5506" active="#struct-441569" type="indirect">
<org type="institution" xml:id="struct-441569" status="VALID">
<idno type="IdRef">02636817X</idno>
<idno type="ISNI">0000000122597504</idno>
<orgName>Centre National de la Recherche Scientifique</orgName>
<orgName type="acronym">CNRS</orgName>
<date type="start">1939-10-19</date>
<desc>
<address>
<country key="FR"></country>
</address>
<ref type="url">http://www.cnrs.fr/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>France</country>
</affiliation>
</author>
</analytic>
<series>
<title level="j">International Journal of Computer Science Issues</title>
<idno type="ISSN">1694-0784</idno>
<imprint>
<date type="datePub">2011-05</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="mix" xml:lang="en">
<term>Data Mining</term>
<term>Information Retrieval</term>
<term>OCR</term>
<term>Text Mining</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">The classification of digital documents is a complex task in a document analysis flow. The amount of documents resulting from the OCR retro-conversion (optical character recognition) makes the classification task harder. In the literature, different features are used to improve the classification quality. In this paper, we evaluate various features on OCRed and non OCRed documents. Thanks to this evaluation, we propose the HYBRED (HYBrid REpresentation of Documents) approach which combines different features in a single relevant representation. The experiments conducted on real data show the interest of this approach.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
</list>
<tree>
<country name="France">
<noRegion>
<name sortKey="Laroum, Sami" sort="Laroum, Sami" uniqKey="Laroum S" first="Sami" last="Laroum">Sami Laroum</name>
</noRegion>
<name sortKey="Bechet, Nicolas" sort="Bechet, Nicolas" uniqKey="Bechet N" first="Nicolas" last="Béchet">Nicolas Béchet</name>
<name sortKey="Hamza, Hatem" sort="Hamza, Hatem" uniqKey="Hamza H" first="Hatem" last="Hamza">Hatem Hamza</name>
<name sortKey="Roche, Mathieu" sort="Roche, Mathieu" uniqKey="Roche M" first="Mathieu" last="Roche">Mathieu Roche</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000344 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000344 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Hal:lirmm-00723581
   |texte=   Hybred: An OCR Document Representation for Classification Tasks
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024